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^ i' Abstract 
D ' 

. Discrete mixture models are routinely used for density estimation and clustering. 

While conducting inferences on the cluster-specific parameters, current frequentist and 
Bayesian methods often encounter problems when clusters are placed too close together 
to be scientifically meaningful. Current Bayesian practice generates component-specific 
parameters independently from a common prior, which tends to favor similar compo- 
nents and often leads to substantial probability assigned to redundant components that 
are not needed to fit the data. As an alternative, we propose to generate components 
from a repulsive process, which leads to fewer, better separated and more interpretable 
. clusters. We characterize this repulsive prior theoretically and propose a Markov chain 

C/3 ' Monte Carlo sampling algorithm for posterior computation. The methods are illus- 

trated using simulated data as well as real datasets. 

^ , Key Words: Bayesian nonparametrics; Dirichlet process; Gaussian mixture model; Model- 

ed ' based clustering; Repulsive point process; Well separated mixture 

^ ■ 1 Introduction 

I Finite mixture models characterize the density of y £ y C as 

f{y) = ^Ph4>{ynh), (1) 
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. . .' where p = {pi, . . . ,Pk)'^ is a vector of probabilities summing to one, and </>(•; 7) is a ker- 

nel depending on parameters 7 G F, which may consist of location and scale parameters. 
There is a rich literature on inference for fini t e mix ture models fro m both a frequentist 
(IFigueiredo &: Jain (l2002l ): lMuthen fc Shedd^ ^im^j )) and Bayesian (jRichardson &: Greenl . 



1993) 

perspective. 



In analyses of finite mixture models, a common concern is over-fitting in which redundant 
mixture components having similar locations and scales are introduced. Over-fitting can 
have an adverse impact on density estimation, since this leads to an unn ecessarily complex 



model. Another common goal of finite mixture modeling is clustering (jFralev fc Rafterv 



2OO2I ) ■ and having components with similar locations, leads to overlapping kernels and lack 
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of interpretability. Introducing kernels with similar locations but different scales may be 
necessary to fit heavy-tailed and skewed densities, and hence low separation in clustering 
and over- fitting are distinct problems. This article develops a repulsive mixture modeling 
approach whi ch can be applied to both thes e problems. 

Recently, iRousseau &: MengersenI (j201ll ) studied the asymptotic behavior of the pos- 
terior distribution in over-fitted Bayesian mixture models having more components than 
needed. They showed that a carefully chosen prior will lead to asymptotic emptying of the 
redundant components. However, several challenging practical issues arise. For their prior 
and in standard Bayesian practice, one assumes that 'j^ ~ Pq independently a priori. For 
example, if we consider a finite location-scale mixture of multivariate Gaussians, one may 
choose Pq to be multivariate Gaussian-inverse Wishart. However, the behavior of the pos- 
terior can be sensitive to Pq for finite samples, with higher variance Pq favoring allocation 
to fewer clusters. In addition, drawing the component-specific parameters from a common 
prior tends to favor components located close together unless the variance is high. 

Regardless of the specific Pq chosen, for small to moderate sample sizes, the weight 
assigned to redundant components is often substantial. This can be attributed to identifia- 
bility problems that arise from a difficulty in distinguishing between models that partition 
each of a small number of well separated components into a number of essentially identical 
components. This issue leads to substantial uncertainty in clustering and estimation of the 
number of components, and is not specific to over-fitted mixture models; similar behavior 
occurs in placing a prior on k or using a nonparametric Bayes approach such as the Dirichlet 
process. 

The problem of separating componeii t s has been studied for Gaussian mixture models 
(jPasgupta (|l999t ): lDasguDta fc SchulmanI (l2007h ). Two Gaussians can be separated by plac- 
ing an arbitrarily chosen lower bound on the distance between their means. Separated Gaus- 
sians have been mainly utilized to speed up convergence of the Expectation-Maximization 
(EM) algorithm. In choosing a minimal separation level, it is not clear how to obtain a 
good compromise between values that are too low to solve the problem and ones that are 
so large that one obtains a poor fit. As an alternative, we propose a repulsive prior that 
discourages closeness among component-specific parameters without a hard constraint. 

In contrast to the vast majority of the recent Bayesian literature on discrete mixture 
models, instead of drawing the component-specific parameters {"fh} independently from a 
common prior Pq, we propose a joint prior for {71 , . . . , 7/;} that is chosen to assign low density 
to located close together. We consider two types of repulsive priors, (i) priors guarding 
against over-fitting by penalizing redundant kernels having close to identical locations and 
scales and case (ii) priors discouraging closeness in only the locations to favor well separated 
clusters. 



2 Bayesian Repulsive Mixture Models 
2.1 Background on Bayesian mixture modeling 

Considering the finite mixture model in expression ([1]), a Bayesian specification is com- 
pleted by choosing priors for the number of components k, the probability weights p, and 
the component-specific parameters 7 = (71, . . . , 7^)-^. Typically, k is assigned a Poisson 
or multinomial prior, p a Dirichlet{a) prior with a = (ai, . . . , Ofc)"^, and -fh ~ Pq inde- 
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pendently, with Pq often chosen to be conjugate to the kernel (j). Posterior computation 
can proceed via a reversible jump Markov chain Monte Carlo algorithm involving moves for 
adding or deleting mixture components. Unfortunately, in making a. k ^ k + 1 change in 
model dimension, efficient moves critically depend on the choice of proposal density. 

It has become popular to use over-fitted mixture models in which k is chosen as a 
conservative upper bound on the number of components under the expectation that only 
rel atively few of the com ponents will be occupied by subjects in the sample. As motivated 



m 



Ishwaran &: Zarepoud (|200 2,). simply letting ah = c/k for h = 1, . . . ,k and a constant 
c > leads to an approximation to a Dirichlet process mixture model for the density of 
y, which is obtained in the limit as k approaches infinity. An alternative finite approxi- 
mation to a Dirichlet process mixture is obtained by truncating the stick-breaking repre- 
sentation of Sethuraman (1994), leading to a similarly simple Gibbs sampling algorithm 



(jlshwaran &: Jamesl . 120011 ). These approaches are now used routinely in practice. 



2.2 Repulsive densities 

We seek a prior on the component parameters in ([1]) that automatically favors spread out 
components near the support of the data. Instead of generating the atoms independently 
from Pq, one could generate them from a repulsive process that automatically pushes the 
atoms apart. This i dea is conceptually related to the literature on repulsive point processes 
dHuber fc Wolpertl . l2009l 'l . ] n the spatial statistics literature, a variety of repulsive processes 



have been proposed. One such model assumes that poi nts are clustered spatia lly, with 



the vector of cluster centers 7 having a Strauss density (jLawson fc Clarkl . |2002| ) . that is 
p{k,j) oc /S'^p^^'^^ where k is the number of clusters, /3>0, 0</9<l and r{'y) is the 
number of pairwise centers that lie within a pre-specified distance r of each other. A 
possibly unappealing feature is that repulsion is not directly dependent on the pairwise 
distances between the clusters. We propose an alternative class of priors, which smoothly 
push apart components based on their pairwise distances. 

Definition 1. A density h{^) is repulsive if for any 6 > there is a corresponding e > 
such that h{j) < 6 for a// 7 € F \ G^, where = {7 : ^(7^,7^) > e; s = 1, . . . , k; j < s} and 
d is a distance. 

We consider two special cases (i) d{^s,7j) is the distance between the sth and j'th kernel, 
(ii) d{'js, Ij) is the distance between sub- vectors of 7^ and 7^ corresponding to only locations. 
Priors following definition [T^i) limit over- fitting in density estimation, while priors following 
definition [Ul^ii) favor well-separated clusters. 

As a convenient class of repulsive priors which smoothly push components apart, we 
propose 

Al) = ci [Uaoilj)^ /i(7), (2) 

with ci being a normalizing constant that can be intractable to calculate. The dependence 
of ci on k leads to complications in estimating k that motivate the use of an over-specified 
mixture that treats k as an upper bound on the number of components. The proposed prior 
is closely related to a class of point processes from the statistic al physics and spatial statistics 
literature called Gibbs processes ( Dalev &: Vere- Jones . 20081 ) . We assume (70 : F — > and 
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Figure 1: Contour plots of the repulsive prior '7r(7i,72) satisfying (definition [T]^ii) un(der ([2]) 
either ([3D or (g]) and ([5]) with hyperparameters (r, i/) equal to (/)(1, 2), 4), (///)(5, 2) 

and (/y)(5,4) 



h -.T^ ^ [0, oo) are continuous with respect to Lesbesgue measure, and h is bounded above 
by a positive constant C2 and is repulsive according to definition [1] with d differing across 
cases. It follows that density vr defined in ([2]) is also repulsive. For location-scale kernels, let 
7j = {fij,T,j) and gQ{fij,Tij) = with fij and Sj being respectively the location 

and the scale parameters. A special hardcore repulsion is produced if the repulsion function 
is zero when at least one pairwise distance is smaller than a pre-specified threshold. Such a 
density implies choosing a minimal separation level between the atoms. 

We avoid hard separation thresholds by considering repulsive priors that smoothly push 
components apart. In particular, we propose two repulsion functions defined as 

^(7)= n 9{d{7s,7j)} (3) h{-f)= min g{d{-fs,jj)} (4) 

{{sJ)gA} {(sj)gA} 

with A = {{s,j) : s = 1, . . . ,k;j < s} and g : [0, M] a strictly monotone differen- 

tiable function with g{0) = 0, g{x) > for all x > and M < 00. It is straightforward 

to show that /i in ([3]) and ^ is integrable and satisfies definition [1] The two alternative 

repulsion functions differ in their dependence on the relative distances between components, 

with all the pairwise distances playing a role in ([3]), while (|4]) only depends on the minimal 

separation. A flexible choice of g corresponds to 

9{d{ls,lj)} = exp [ - T{(i(7s,7j )}"''], (5) 

where r > is a scale parameter and is a positive integer controlling the rate at which g 
approaches zero as d{'ys,7j) decreases. Figure [1] shows contour plots of the prior 7r(7i,72) 
defined as ([2]) and satisfying definition [TJii) with 71,72 € d the Euclidean distance, go the 
standard normal density, the repulsive function defined as ([3]) or (|4]) and g defined as ([5]) for 
different values of (t, i^). As r and v increase, the prior increasingly favors well separated 
components. 
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2.3 Theoretical properties 

In this section, theoretical properties of the proposed prior are considered under definition 
(TJ^ii) for simplicity, though all results can be modified to accommodate definition [^i) . For 
some results, the kernel will be assumed to depend only on location parameters, while for 
others on both location and scale parameters. Let H be the prior induced on U^^J-fc, 
where T]^ is the space of all distributions defined as ([TJ. Let || • ||i denote the L\ norm and 
KL(fQ,f) = J /olog(/o//) refer to the Kullback-Leibler (K-L) divergence between /q and 
/. Density /q belongs to the K-L support of the prior H if n{/ : KL{fQ, /) < e} > for all 
e > 0. Let the true density /q : K"* be defined as /o = Ylh=i Poh^'ilOh) with 70/4 G T 

and 7ojS such that there exists an ei > such that mm^(^g jyg^jyd{'yos,^Oj) ^ ^1, d being 
the Euclidean distance of sub-vectors of 70^ and 70s corresponding to only locations. Let 
/ = Ylh=iPh4'{lh) with 7/i € r. Let 7 ~ vr and vr satisfy definition [T^ii). Let p ~ A with 
A = Dirichlet{a) and /c ~ with '9{k = ko) > 0. Let 9 = (p, 7). These assumptions on 
/o and / will be referred to as condition BO. The next lemma provides sufficient conditions 
under which the true density is in the K-L support of the prior for location kernels. 

Lemma 1. Assume condition BO is satisfied with m = 1. Let Dq he a compact set containing 
location parameters (701, . . . ,7ofco)- -^^^ 4' ^I'^d vr satisfy the following conditions: 

Al. for any y ^y, the map 7 — )• (p(y;'y) is uniformly continuous 

A2. for any y (z y, 4>{y; 7) is bounded above by a constant 

A3. J fo |log {sup^g^g 4>{j)} - log {inf^gDo (/^il)}] < 00 

A4-- vr is continuous with respect to Lebesgue measure and for any vector x ^ 
with min|(<,j).g<j} Xj) > v for f > there is a 6 > such that vr(7) > 
for all 7 satisfying II7 — x||i < 6 

Then /q is in the K-L support of the prior U. 

Lemma 2. The repulsive density in ^ with h defined as either ^ or ^ satisfies condition 
A4 in lemmaUl 

The next lemma formalizes the posterior rate of concentration for univariate location 
mixtures of Gaussians. 

Lemma 3. Let condition BO be satisfied, let m = 1 and (j) be the normal kernel depending on 
a location para meter n and a sc ale parameter a. Assume that condition (i), (ii) and (Hi) of 



theorem 3.1 in lScricciold l!201J\) and assumption A4 in lemma{l\are satisfied. Furthermore, 
assume that 

CI ) the joint density vr leads to exchangeable random variables and for all k the marginal 

density of fii satisfies vr^d^il >t)^ exp [—qit"^) for a given qi > 

C2) there are constants ui,U2,U3 > 0, possibly depending on fo, such that for any e < U3 

vr(||^-^o||i < e) > niexp(-n2A;olog(l/e)) 
Then the posterior rate of convergence relative to the Li metric is = n^^/^ logn. 



Lemma [3] is basically a modification of theorem 3.1 in IScricciold (j201ll ) to our proposed 
repulsive mixture model. Lemma U gives sufficient conditions for vr to satisfy condition CI 
and C2 in lemma [3l 



5 



Lemma 4. Let vr be defined as ^ and h be defined as either ^ or then vr satisfies 
condition C2 in lemma\^ Furthermore, if for a positive constant ui the function ^ satisfies 
>t)< exp(— nit^), vr satisfies condition CI in lemma\^ 

As motivated above, when the number of mixture components is chosen to be con- 
servatively large, it is appealing for the posterior distribution of the weights of the extra 
components to be concentrated near zero. Theorem [1] formalizes the rate of concentration 
with increasing sample size n. One of the main assumptions required in theorem[T]is that the 
posterior rate of convergence relative to the Li metric is 5„ = n~^/^(log n)'' with (7 > 0. We 
provided the contraction rate, under the proposed prior specification and univariate Gaus- 
sian kernel, in lemma [3j However, theorem [1] is a more general statement and it applies to 
multivariate mixture density of any kernel. 

Theorem 1. Let assumptions BO — B5 be satisfied. Let vr be defined as ^ and h be defined 
as either ^ or // a = max(ai, . . . , a^) < m/2 and for positive constants ri,r 2, r 3 the 
function g satisfies g[x) < rix^'^ for < x < r^ then 



lim lim sup E„ 

M->-oo n-)-oo 



\q{l+s{ko,a)/sr2) 



with s{ko, a) = kQ — 1 + mko + a{k — ko), 5^3 = r2 + m/2 — a and Sk the set of all possible 
permutations of {1, . . . , k}. 

Theorem [1] is a modification of theorem 1 in Rousseau and Mengersen (2011) to our 
proposed repulsive mixture model. Theorem [1] implies that the posterior expectation of 
weights of the extra components is of order 0(n~^/^(log n)'^(^+''('''''")/*''2)). When g is defined 
as dS]), parameters ri and r2 can be chosen such that ri = r and r2 = i^. 

When the number of components is unknown, with only an upper bo und known, the 



poste rior rate of convergence is equivalent to the parametric rate n ( Ishwaran et al 



2OOII ). In this case, the rate in theorem [T] is n~^/^ under usual priors or our repulsive prior. 
However, in our experience using usual priors, the sum of the extra components can be 
substantial in small to moderate sample sizes, and often has high variability. As we show 
in Section [U for repulsive priors the sum of the extra component weights is close to zero 
and has small variance for small as well as large sample sizes. When an upper bound on the 
number of components is unknown, the posterior rate of concentration is n~^/^(log n)'^ with 
q > 0. In this case, a ccording to theorem [H using our prior specification the logarithmic 



factor in theorem 1 of lRousseau &: Mengersenl ()201lh can be improved. 



3 Parameter Calibration and Posterior Computation 

An important issue in implementing repulsive mixture models is elicitation of the repulsion 
hyper-parameters {t,u). Although a variety of strategies can be considered, we propose a 
simple approach that can be used to obtain a default hyper-parameter choice in general 
applications. In case (i) we choose d{-,-) as the symmetric Kullback-Leibler divergence 
defined for Gaussian kernels as 

S12 = d(7i, 72) = tr(SiS2 1) + tr(S^iS2) - 2m + (^1 - ^2)^(2^' + ^2^)il^i - 1^2), 
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while in case (ii) we use the Euchdean distance between the location parameters. For 
both case (i) and case (ii), define d as the mean of pairwise distances between atoms, 
d= ;^Z](s,i)eA'^(7s,7j) with A = {(s, j) : s = I, . . . , k; j < s} and n{A) the cardinality 
of set A. Let /i and /2 denote the densities of d under repulsive and non-repulsive priors re- 
spectively, with (qj, <fj) the mean and standard deviation of fj for j = 1, 2. We c hoose (r, v 



so th at fi and /2 are well-separated using the following definition of separation (jDasgupta 



1999|). 



Definition 2. Given a positive constant c, fi and f2 are c- separated if Q1 — Q2 > cmax(<Ji, <J2)- 

We have found that v = 2 and u = 1 provide good default values in case (i) and (ii) 
respectively and we fix v at these values in all our applications below. For a given value of 
T is found by starting with small values, estimating the mean and variance of d through 
Monte Carlo draws, and incrementing r until definition [2] is satisfied for a pre-specified c. 
We use c = 4 in our implementations. 



For posterior computation, we use a slice sampling algorithm (jNeall . |2003| ). a class of 



Markov chain Monte Carlo algorithms widely used for posterior inference in infinite mixture 



models (jKalli et al.l . l201ll ). Letting qq be a conjugate prior, introduce a latent variable u 



which is jointly modeled with 7 through 

7r(7i, . . . ,7fc,u) oc 50(7/1)^ 1 {^(71,- • ■,lk)>u}. 

Here l(i?) is the indicator function, equalling 1 if the event B occurs and otherwise. 
Marginalizing out u, we recover the original density 7r(7i, . . . ,7^). For a repulsion func- 
tion defined as let Bj = p||g.^_^jj [7^ : fl'{d(7s, 7j)} > u]. As long as g is invertible in 
its argument, the set Bj can be calculated, making sampling straightforward. When the 
repulsion function is defined as Q, one can introduce a latent variable for each product 
term. Under repulsive priors satisfying definition (H^i), the set Bj might not be easy to 
compute. However, when covariance matrices are constrained to be diagonal, vectors 7^8 
can be easily sampled element-wise. For multivariate observations, the location parameter 
vector can be sampled element-wise from truncated distributions. Details can be found in 
the supplementary materials. 



4 Synthetic Examples 

Simulation examples were considered to assess the performance of the repulsive prior in 
density estimation, clustering and emptying of extra components. Figure [2] plots the true 
densities in the various cases that we considered. For each synthetic dataset, repulsive 
and non-repulsive mixture models were compared considering a fixed upper bound on the 
number of components; extra components should be assigned small probabilities and hence 
effectively excluded. The slice sampler was run for 10, 000 iterations with a burn-in of 5, 000. 
The chain was thinned by keeping every 10th draw. To overcome the label switching prob- 
lem, the samples were post-processed following the algorithm of Stephen^ ((20Q0)- Details 



on parameters involved in the true densities, choice of prior distributions and methods used 
to compute quantities presented in this section can be found in the supplement. 
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(I) 



(11) 




Figure 2: (/) Standard normal density (solid), two-component mixture of normals shar- 
ing the same location parameter (dash) and Student's t density (dash-dot), referred as 
{la, lb, Ic), {II) two-components mixture of poorly (solid) and well separated (dot-dash) 
Gaussian densities, referred as {I I a, lib), {III) mixture of poorly (dot-dash) and well sepa- 
rated (solid) Gaussian and Pearson densities, referred as {Ilia, Illb), {IV) two-components 
mixture of two-dimensional non-spherical Gaussians 

Repulsive mixtures satisfying definition [TJi) and non-repulsive mixtures were compared. 
For this experiment 1,000 draws from a standard normal density and a two component 
mixture of overlapping normals was considered. Both repulsive and non-repulsive mixtures 
were run considering six as the upper bound of the number of components. Table [1] shows 
posterior summaries of parameters involved in the components with highest weights. Clearly, 
repulsive mixtures lead to a more parsimonious representation of the true densities and more 
accurate parameter estimates. The mean and standard deviation of the K-L divergence 
under the first data example were (0-003, 0-002) and (0-004, 0-002) for non-repulsive and 
repulsive mixtures respectively; while under the second data example were (0-006, 0-003) 
and (0-009, 0-003) for non-repulsive and repulsive mixtures respectively. Therefore, repulsive 
mixtures were able to concentrate more on the reduced model while performing similarly to 
non-repulsive mixtures in estimating the true density. 

Repulsive mixtures satisfying definition [1] (ii) and non- repulsive mixtures were compared 
to assess clustering performance. Table [2] shows summary statistics of the K-L divergence, 
the misclassification error and the sum of extra weights under repulsive and non-repulsive 
mixtures with six mixture components as the upper bound. Table [2] shows also the misclas- 
sification error resulting from hierarchical clustering (?). In practice, observations drawn 
from the same mixture component were considered as belonging to the same category and 
for each dataset a similarity matrix was constructed. The misclassification error was estab- 
lished in terms of divergence between the true similarity matrix and the posterior similarity 
matrix. As shown in table O the K-L divergences under repulsive and non-repulsive mix- 
tures become more similar as the sample size increases. For smaller sample sizes, the results 
are more similar when components are very well separated. Since a repulsive prior tends 
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to discourage overlapping mixture components, a repulsive model might not estimate the 
density quite as accurately when a mixture of closely overlapping components is needed. 
However, as the sample size increases, the fitted density approaches the true density regard- 
less of the degree of closeness among clusters. Again, though repulsive and non-repulsive 
mixtures perform similarly in estimating the true density, repulsive mixtures place consid- 
erably less probability on extra components leading to more interpretable clusters. In terms 
of misclassification error, the repulsive model outperforms the other two approaches while, 
in most cases, the worst performance was obtained by the non-repulsive model. 

Potentially, one may favor fewer clusters, and hence possibly better separated clusters, 
by penalizing the introduction of new clusters more through modifying the precision in the 
Dirichlet prior for the weights; in the supplemental materials, we demonstrate that this 
cannot solve the problem. 

Table 1: Posterior mean and standard deviation of weights, location and scale parameters 
under dataset drawn from densities (7a, lb) 
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(0-05) 
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Table 2: Mean and standard deviation of K-L divergence, misclassification error and sum of 
extra weights resulting from non-repulsive mixture and repulsive mixture with a maximum 
number of clusters equal to six under different synthetic data scenarios. 
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Figure 3: Histogram of galaxy data (I) and acidity data (II) overlaid with a nonparametric 
density estimate using Gaussian kernel density estimation 



5 Real data 

We tested the performance of our proposed prior specification on three real datasets. The 
first involves 82 measure n ients of the velocities i n kni /s of galaxies diverging from our own 
([Escobar &: WestI (j 19951 ). iRichardson fc GreenI (jl997l )). the second c onsists of the acidity 



index measured in a sample of 155 lakes in north central Wisconsin ([Richardson &: Green 
(fT997l )l. and the third consists of 150 observations from three different species of iris each 
with four measurements (?). 

For the first two datasets, a repulsive mixture satisfying definition [T]^i) was considered 
and a five-component mixture model was fit while for the third dataset a repulsive mixture 
satisfying definition [T]J^ii) was considered and both six components and ten components were 
considered as the upper bound. The same prior specification, Markov chain Monte Carlo 
sampler, and relabeling technique as in section [4] were utilized. 

For the galaxy data, figure [3] reveals that there are three non-overlapping clusters with 
the one close to the origin relatively large compared to the others. Although this large cluster 
might be interpreted as two highly overlapping clusters, it ap pears to be well approxim ated 
by a single normal density. IRichardson &: GreenI ([l997l ) and lEscobar fc WestI ([ 19951 ) esti- 
mated the number of components, obtaining a posterior distribution on k concentrating on 
values ranging from 5 to 7. This may be due to the non-repulsive prior allowing closely 
overlapping components, favoring relatively large values of k. Figure [Jj reveals that the 
non-repulsive prior specification leads to two overlapping and essentially indistinguishable 
clusters. Under repulsive priors, no clusters overlap significantly and unnecessary compo- 
nents receive a weight close to zero. 

For the acidity data, figure [3] suggests that two clusters are involved. Since one of them 
appears to be high ly skewed, we expect that th ree clusters might be needed to approximate 
this density well. IRichardson &: GreenI ([19971 ) obtained a posterior for k almost equally 
concentrated on values of k ranging from 3 to 5. Figure H) shows the estimated clusters 
for both repulsive and non-repulsive priors. With non-repulsive priors, four clusters receive 
significant weight and two of them overlap significantly. With repulsive priors, only three 
clusters receive significant weight and all of them appear fairly separated. 

The iris data were previously analyzed by ? and ? using new methods to estimate 
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(11) 



(IV) 




Figure 4: Estimated clusters under galaxy data for non-repulsive (I) and repulsive (II) priors 
and under acidity data for non-repulsive (III) and repulsive (IV) priors 



the number of clusters based on minimizing loss functions. They concluded the optimal 
number of clusters was two. This result did not agree with the number of species due 
to low separation in the data between two of the species. Such point estimates of the 
number of clusters do not provide a characterization of uncertainty in clustering in contrast 
to Bayesian approaches. Repulsive and non-repulsive mixtures were fitted under different 
choices of upper bound on the number of components. Since the data contains three true 
biological clusters, with two of these having similar distributions of the available features, we 
would expect the posterior to concentrate on two or three components. Posterior means and 
standard deviations of the three highest weights were (0-30, 0-23, 0T3) and (0-05, 0-04, 0-04) 
for non-repulsive and (0-56, 0-29, 0-08) and (0-05, 0-04, 0-03) for repulsive. Clearly, repulsive 
priors lead to a posterior more concentrated on two components, and assign low probability 
to more than three components. Figure [5] shows the density of the total probability assigned 
to the extra components. This quantity was computed considering the number of species as 
the true number of clusters. According to figure O our repulsive prior specification leads to 
extra component weights very close to zero regardless of the upper bound on the number 
of components. The posterior uncertainty is also small. Non-repulsive mixtures assign 
large weight to extra components, with posterior uncertainty increasing considerably as the 
number of components increases. 
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Figure 5: Density of sum of extra weights under k=6 for non-repulsive (solid) and repulsive 
(dash) and k=10 components for non-repulsive (dash-dot) and repulsive (dot) 



Supplementary Material 



Supplementary material includes the proof of lem ma 2 and le mma 4, assumptions B1-B5, 
conditions (i ) , (ii) and [iii) of theorem 3.1. in Scricciolol ( 2011 ) and theorem 2.1. 
Ghosal et all jioOfl). 
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Appendix 

Throughout the appendix we write all constants whose values are of no consequence to be 
equal to 1. 

of lemmalJl By assumption BO, 'd{k = ko) > 0. We consider the case / is a finite mixture 
with ko components. By assumption Al, for each 77 > there is a corresponding 6 > 
such that, for any given y & y and for all 71,72 G T with I71 — 72I < 6, we have that 
l</'(y;7i) - <P{y'^l2)\ < V- Let Ss = Ps x with = {7 : \jj -7oj| < 6,j < ko} and 
Ps = {p ■ \Pj ~ Poj\ ^ <JiJ ^ ^o}- By assumption Al and A2, for any given y and for any 
1] > 0, there is a (5 > such that |/o — /| < if ^ € Ss- This means that, / ^ /o as — )• 
for any given y. Equivalently, we can say that | log(/o//)| — > pointwise as 9 ^ Oq. Notice 
that 



|log(/o//)| < 



log < sup (^(7) > - log 

76-DO 



inf 0(7) 



By assumption A3 and applying the dominated convergence theorem, for any e > there 
is a 5 > such that J /olog(/o//) < e if ^ € Ss- By the independence of the weights and 
the parameters of the kernel, 

U{KL{foJ)<e)>X{Ps)7T{Ts) 

Assumption A4 combined with the fact that {7 : II7 — 7o||i < 5} ^ T^ result in 7r{Ts) > 0. 
Finally, since A = Dirichlet(a), it can be shown that X{Ps) > 0. □ 

of lemma To prove lemma [3] we need to show that the three conditions of theorem 2.1 
in Ghosal et al. ( 2000l ) are satisfied. First, define D{e,J-,ds) as the maximum number of 
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points in T such that the distance, with respect to metric d^, between each pair is at least 
e. Let ds be either the Helhnger metric or the one induced by the Ll-norm. For given 
sequences kn,amUn'\ oo and 6^ 4 define 



and 



'-'i— 1-' n 



As it is shown in Scricciolol ( 2011 ). for co nstants .f2 > . fi > and 



h,h,h > 0, derived below to satisfy condition (2) and (3) in Ghosal et al. ( 20001 ) . de- 



fine /i log n < kr, 
logD{en,Tn,ds 



< /2logn, Or, 



n 



-1/2 lo: 



log e„l)^^^ 



bn = /i(loge-i)-i/^2 and 



-h 



,n. 



Let Anj = {—an,any- Li order to show condition (2) of theorem 2.1. in lGhosal et al. 



( 2000l ). we need to show that there is a constant qi > such that 7r(A^^y 
From the exchangeability assumption it follows 



exjp{-qial 



PriAZ,\k = s) =E, 



(^-1)! 



tVT 



Therefore, condition CI implies that, for a positive constant qi we have tt{A ^j^j 



< 



E{k) exjp{—qia'^) with -E'(A;) < oo by condition (ii) of theorem 3.1. in Scricciolol (2011 ). 



Given a positive constant Z2 chosen to satisfy condition (3) in theorem 2.1 of I Ghosal et al. 
(H), let /i > (Z2+4)M, h < {ei/4(z2 +4)}^/^^j2^4(z2+4Ve3 and/3 > {4(z2 + 4)/gi} 



1/2 



Under these values of f^,l^, h and ^3, following Scricciolol (2011), assumptions (i), (ii) 



of theorem 3.1. in 



Scricciolol (2011) combined with assumption CI imply n(J^ \ J>i) < 
exp { — {z2 + 4:)n€^] with e„ = n~i/^(log n)i/^ 



To show condition (3 ) of theorem 2.1 in I Ghosal et al. ( 2000l ). we can again follow the 
proof of theorem 3.1. in Scricciolol ( 201ll ). The only thing we need to show is that, there 
are constants ui,U2,U3 > such that for any < 

7r(||^-^o||i < en) > niexp{-U2A;olog(l/en)} 

that is guaranteed by condition C2. Therefore, it can be easily showed that, for sufficiently 
large n, Z2 > and e„ = n-i/2(log n)i/2, U{BKLifo,el)} > exp{-Z2nel). □ 

of theorem [71 Only for this proof and for ease of notation the density / will be referred as 
fg. Define the non identifiability set as T = {9 : fo = /o}. In order to define each vector 
in T, let = to < ^1 < ^2 • • • < ^feo ^ ^ and 7^ = jQi for j G Ii = {tj-i + l,ti}. Let 
Poi = I]j=t,_i+i Pj and pj = for j > t^o- Define qj = pj/poi for j G Ii. 

Define = {mii^.^s,} (Ei=i" P.{^)) > '^nMn} and A'„ = A„ n {||/ - /o||i < M(5„}. 
Let Dn = /{||/-/o|ji<(5 }6xp(Zn(^) — ln{So))d['K X X){6) with ln{Oo) being the log-likelihood 
evaluated at ^o- Along the line of iRousseau &: Mengersen proof, to prove theorem 

1 we need to show that for any e > there are positive constants mi,m2 and a permutation 
L € Sk such that 



Dn > rriin' 



-s(ko,a)/2 



(6) 



U{A'J < m25:('=«'")M°-'"/2 (7) 
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with s{kQ, a) = ko — 1 + mko + X]jZf° '^i(i)- Fohowing Rousseau &: Mengersen (2011)'s 
proof, we c a.n show that, under condition B5, ^ is satisfied for sufficiently large n. Con- 
cerning ([7]), Rousseau &: MengersenI ( 2011 ) showed that on A'^, there is a set containing 
indices ji and j2 such that 

hn - 7oi| < i^n/qn)^^'^ , |7j2 - 7oi| < i^n/qh)^^^ 
with Qj-^ > e/kQ and g^-j > Therefore, from the triangle inequality it follows 

hh -Tial < {2'5n/min(gj^,gj2)}^/^ 

— 1/2 

Now, for sufficiently large n, mm{qj-^^ , qj^) > 6nMn/2 and therefore — < Mn 
Recalling that g is bounded above by a positive constant, there exists a constant c > such 
that 

(8) 



/i(7) < cg{d{j,„jj,)} < eg [m-'/^) 



Let the prior probability of the set be defined as n(yl^) = J^, d{'K x A) (7 x p). T o find 
an upper bound for this integral, directly apply the proof of iRousseau &: MengersenI (j201ll ) 
showing that n(A^) < g (^Mn ^^^^ 6n^'^'°''^ Mn "^^"^ . By assumption, for sufficiently large n, 

g (^Mn^^"^^ < riM~'"2. Letting 5.^2 = r2 + m/2 — a, it follows 

Therefore, M„ = (log n)''''('=0'")/'''-2 implies U{A'J = Op{D„). □ 
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